Goto

Collaborating Authors

 visual grounding



SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion Ming Dai 1, Lingfeng Y ang

Neural Information Processing Systems

Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning.



CityRefer Datasheet We follow the guidelines of the datasheets for datasets [ 1 ] to explain the composition, collection, recommended use case, and other details of the CityRefer dataset

Neural Information Processing Systems

For what purpose was the dataset created? We created this CityRefer dataset to facilitate research toward city-scale 3D visual grounding. Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., Who funded the creation of the dataset? What do the instances that comprise the dataset represent? CityRefer contains descriptions for 3D visual grounding on large-scale point cloud data.






Exploiting Contextual Objects and Relations for 3D Visual Grounding

Neural Information Processing Systems

However, this task is challenging due to the necessity to capture 3D contextual information to distinguish target objects from complex 3D scenes. The absence of annotations for contextual objects and relations further exacerbates the difficulties. In this paper, we propose a novel model, CORE-3DVG, to address these challenges by explicitly learning about contextual objects and relations. Our method accomplishes 3D visual grounding via three sequential modular networks, including a text-guided object detection network, a relation matching network, and a target identification network. During training, we introduce a pseudo-label self-generation strategy and a weakly-supervised method to facilitate the learning of contextual objects and relations, respectively. The proposed techniques allow the networks to focus more effectively on referred objects within 3D scenes by understanding their context better. We validate our model on the challenging Nr3D, Sr3D, and ScanRefer datasets and demonstrate state-of-the-art performance.


Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding

Neural Information Processing Systems

The main question we address is "can we consolidate the 3D visual stream by 2D clues and efficiently utilize them in both training and testing phases?". The main idea is to assist the 3D encoder by incorporating rich 2D object representations without requiring extra 2D inputs. To this end, we leverage 2D clues, synthetically generated from 3D point clouds, that empirically show their aptitude to boost the quality of the learned visual representations.